Wine Quality Analysis¶

Created 17-Oct-2024 Mark A. Goforth, Ph.D.¶

Purpose¶

This notebook is designed for EDA and train a DNN model to perform a quality estimation of wine by it's chemical composition.

Goal¶

Challenges & Discussion¶

General Steps for Approach¶

  1. Download data

    • wine quality data is downloaded from kaggle
  2. EDA

    • identify independent variables that influence the outcome
  3. Feature Engineering

    • normalize and standardize independent variables as necessary
    • reduce dimensionality
  4. Train/Test Split

    • split data for training and final testing to see how performance will be in the real world
    • use random shuffle and stratified split to preserve proportions of classes
  5. Model Selection, Cross Validation, and Tuning

    • use K-fold cross validation to reduce bias, build more generalized model and prevent overfitting
    • apply hyperparameter tuning to search for best settings that provide improved bias and variance
  6. Model Validation

    • run model on test set to see how model will perform on real world data
  7. Create GAN (TBD)

    • create a Generative Adversarial Network (GAN) deep learning architecture
    • train two neural networks to compete against each other to generate more authentic new data from a given training dataset
  8. Create VAE (TBD)

    • create a Variational Autoencoder (VAE) deep learning architecture
    • train neural network to use in anomaly detection

Conclusion¶

Install any necessary python packages¶

In [ ]:
!pip install kagglehub
In [ ]:
!pip install tensorflow
In [ ]:
!pip install keras_tuner

Import Libraries¶

In [2]:
import datetime
import time

import numpy as np 
import pandas as pd 
import seaborn as sns 
import scipy.stats as stats
import statsmodels.api as sm

# import pylab as plt
from IPython.display import Image
from IPython.core.display import HTML 
from pylab import rcParams

import sklearn
from sklearn import decomposition
from sklearn.decomposition import PCA
from sklearn import datasets

import kagglehub
import ppscore as pps

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold, GroupShuffleSplit
from sklearn import metrics
from sklearn.metrics import confusion_matrix

import pickle

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, losses
from matplotlib import pyplot as plt

import keras_tuner
import keras

Download latest dataset version¶

In [3]:
pathstr = kagglehub.dataset_download("adarshde/wine-quality-dataset")
print("Path to dataset files:", pathstr)
df = pd.read_csv(pathstr+'/winequality-dataset_updated.csv')
# df = df.drop_duplicates()
Path to dataset files: /Users/Mark/.cache/kagglehub/datasets/adarshde/wine-quality-dataset/versions/3

Exploratory Data Analysis (EDA)¶

In [4]:
df.head()
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.3 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.2 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1999 non-null   float64
 1   volatile acidity      1999 non-null   float64
 2   citric acid           1999 non-null   float64
 3   residual sugar        1999 non-null   float64
 4   chlorides             1999 non-null   float64
 5   free sulfur dioxide   1999 non-null   float64
 6   total sulfur dioxide  1999 non-null   float64
 7   density               1999 non-null   float64
 8   pH                    1999 non-null   float64
 9   sulphates             1999 non-null   float64
 10  alcohol               1999 non-null   float64
 11  quality               1999 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 187.5 KB
In [6]:
df.describe().T.style.background_gradient(axis=0)
Out[6]:
  count mean std min 25% 50% 75% max
fixed acidity 1999.000000 8.670335 2.240023 4.600000 7.100000 8.000000 9.900000 15.900000
volatile acidity 1999.000000 0.541773 0.180381 0.120000 0.400000 0.530000 0.660000 1.580000
citric acid 1999.000000 0.246668 0.181348 0.000000 0.110000 0.200000 0.385000 1.000000
residual sugar 1999.000000 3.699090 3.290201 0.900000 2.000000 2.300000 3.460000 15.990000
chlorides 1999.000000 0.075858 0.048373 0.010000 0.056000 0.075000 0.086000 0.611000
free sulfur dioxide 1999.000000 20.191096 15.642224 1.000000 9.000000 16.000000 27.000000 72.000000
total sulfur dioxide 1999.000000 52.617809 37.051121 6.000000 24.000000 42.000000 73.000000 289.000000
density 1999.000000 0.996477 0.002110 0.990070 0.995265 0.996600 0.997800 1.003690
pH 1999.000000 3.290140 0.274297 2.340000 3.180000 3.300000 3.420000 4.160000
sulphates 1999.000000 0.949465 0.780523 0.330000 0.560000 0.650000 0.840000 3.990000
alcohol 1999.000000 10.671161 1.369932 8.400000 9.500000 10.400000 11.400000 15.000000
quality 1999.000000 5.637819 1.255574 2.000000 5.000000 6.000000 6.000000 9.000000

Attribute Information¶

Feature Explain
fixed acidity most acids involved with wine or fixed or nonvolatile
volatile acidity the amount of acetic acid in wine
citric acid the amount of citric acid in wine
residual sugar the amount of sugar remaining after fermentation stops
chlorides the amount of salt in the wine
free sulfur dioxide the amount of free sulfur dioxide in the wine(those available to react and thus exhibit both germicidal and antioxidant properties)
total sulfur dioxide amount of free and bound forms of SO2
density the measurement of how tightly a material is packed together
PH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4
Alcohol the percent alcohol content of the wine
quality output variable (based on sensory data, score between 3 and 8)

check for missing values¶

In [7]:
df.isna().sum()
Out[7]:
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

Visualization - create histograms for each independent variable¶

In [8]:
for i in df.columns:
    plt.figure(figsize=(6, 4)) 
    sns.histplot(data=df[i])
    plt.title(f'{i}')
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Visualization - create box plots¶

In [9]:
columns = list(df.columns)
fig, ax = plt.subplots(11, 2, figsize=(15, 45))
plt.subplots_adjust(hspace = 0.5)
for i in range(11) :
    # ax 1
    sns.boxplot(x=columns[i], data=df, ax=ax[i, 0])
    # ax 2
    sns.scatterplot(x=columns[i], y='quality', data=df, hue='quality', ax=ax[i, 1])
No description has been provided for this image

compare each dependent variable with quality using box plots¶

In [10]:
for i in df.columns:
    if i != 'quality':
        plt.figure(figsize=(6, 4))  # Set figure size for each plot
        sns.boxplot(data=df, x='quality', y= i)
        plt.title(f'Box plot for quality and {i}')
        plt.tight_layout()
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [11]:
for i in df.columns:
    if i != 'quality':
        plt.figure(figsize=(6, 4))
        sns.violinplot(data=df, x='quality', y=i)
        plt.title(f'Violin plot for {i} by Quality')
        plt.tight_layout()
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Correlate each dependent variable with quality¶

In [12]:
%matplotlib inline
rcParams['figure.figsize'] = 12, 10
sns.set_style('whitegrid')
In [13]:
# Plotting the correlation heatmap
dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True, annot_kws={"size": 12})

# Displaying heatmap
plt.show()
No description has been provided for this image
In [14]:
rcParams['figure.figsize'] = 15, 15
sns.pairplot(df, hue='quality', corner = True, palette='Blues')
Out[14]:
<seaborn.axisgrid.PairGrid at 0x16b3772f0>
No description has been provided for this image
In [15]:
# Plot the top N components
dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=True)
# dfc = df.corr().iloc[-1:,:-1].sort_values(by='quality', ascending=False).transpose()
# dfc = dfc.set_index('Source').rename_axis(None)
# dfc = df.corr().iloc[:-1,-1:].sort_values(by='quality', ascending=False).transpose()
# type(dfc)
# dfc.loc['quality'].plot(kind='bar', figsize=(10,4) )
dfc.plot.barh(figsize=(10,4) )
Out[15]:
<Axes: >
No description has been provided for this image

Prepare data for machine learning training¶

In [16]:
X = df.drop('quality', axis=1)
variable_names = X.columns
In [17]:
variable_names
Out[17]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol'],
      dtype='object')
In [18]:
X.head()
Out[18]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 7.3 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8
4 7.2 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4
In [19]:
pca = decomposition.PCA()
wine_pca = pca.fit_transform(X)
explained_variance = pca.explained_variance_ratio_
In [20]:
comps = pd.DataFrame(pca.components_, columns=variable_names)
comps
Out[20]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol
0 0.001846 0.000523 -0.000357 0.028026 -0.000190 0.247027 0.968584 -0.000004 -0.000489 0.005685 0.001142
1 0.015116 0.000233 -0.002384 0.080831 -0.000815 0.964725 -0.248540 -0.000020 -0.001492 0.017462 0.021411
2 0.339973 0.000540 0.002002 0.922064 -0.003746 -0.088645 -0.005505 0.000027 -0.013114 0.108311 0.120032
3 -0.937027 0.011934 -0.036213 0.338959 -0.000910 -0.015761 -0.003945 -0.000321 0.029162 -0.023276 0.063061
4 0.011237 -0.012509 0.005594 -0.144864 -0.007441 -0.010352 0.005077 -0.000565 0.002478 0.099027 0.984225
5 -0.059325 0.021506 -0.046060 -0.080693 -0.005740 -0.008183 -0.001167 -0.000347 -0.012838 0.987419 -0.110104
6 0.039158 0.135308 -0.195042 -0.000585 -0.023180 0.000408 0.000164 -0.000339 0.970346 0.002679 -0.000589
7 0.024627 0.774559 -0.587061 -0.008610 -0.026442 -0.001231 0.000005 -0.001092 -0.227505 -0.044762 0.016495
8 -0.019086 0.614831 0.774360 0.003550 0.126313 0.001359 -0.000541 0.003964 0.073640 0.023861 0.002542
9 0.004182 -0.054479 -0.119156 0.001617 0.991306 0.000169 0.000061 0.003609 0.007158 0.002694 0.007384
10 0.000223 0.001345 0.003371 0.000054 0.004122 -0.000002 -0.000002 -0.999985 0.000231 -0.000237 -0.000516
In [21]:
rcParams['figure.figsize'] = 10, 10
sns.heatmap(comps, cmap='Blues', annot=True )
Out[21]:
<Axes: >
No description has been provided for this image
In [22]:
# Plot the top N components
maxcol = np.argmax(pca.components_, axis=1)
n_components = 5  # Number of top components to display
rcParams['figure.figsize'] = 10, 4
plt.bar(range(0, n_components ), explained_variance[:n_components])
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Top {} Principal Components'.format(n_components))
plt.xticks(np.arange(5), variable_names[maxcol[0:5]])
plt.show()
No description has been provided for this image
In [23]:
ppscore_list = [pps.score(df, colName, 'quality') for colName in variable_names]
df_pp_score = pd.DataFrame(ppscore_list).sort_values('ppscore', ascending=False)
df_pp_score
Out[23]:
x y ppscore case is_valid_score metric baseline_score model_score model
10 alcohol quality 0.024893 regression True mean absolute error 0.925463 0.902425 DecisionTreeRegressor()
1 volatile acidity quality 0.002654 regression True mean absolute error 0.925463 0.923007 DecisionTreeRegressor()
2 citric acid quality 0.001189 regression True mean absolute error 0.925463 0.924362 DecisionTreeRegressor()
0 fixed acidity quality 0.000000 regression True mean absolute error 0.925463 0.942051 DecisionTreeRegressor()
3 residual sugar quality 0.000000 regression True mean absolute error 0.925463 1.055569 DecisionTreeRegressor()
4 chlorides quality 0.000000 regression True mean absolute error 0.925463 0.961211 DecisionTreeRegressor()
5 free sulfur dioxide quality 0.000000 regression True mean absolute error 0.925463 0.976857 DecisionTreeRegressor()
6 total sulfur dioxide quality 0.000000 regression True mean absolute error 0.925463 0.970894 DecisionTreeRegressor()
7 density quality 0.000000 regression True mean absolute error 0.925463 1.081802 DecisionTreeRegressor()
8 pH quality 0.000000 regression True mean absolute error 0.925463 0.988989 DecisionTreeRegressor()
9 sulphates quality 0.000000 regression True mean absolute error 0.925463 0.993949 DecisionTreeRegressor()

normalize data¶

In [24]:
# Create X from DataFrame and y as Target
X_temp = df.drop(columns='quality')
y = df.quality
In [25]:
scaler = MinMaxScaler(feature_range=(0, 1)).fit_transform(X_temp)
X = pd.DataFrame(scaler, columns=X_temp.columns)
X.describe().T.style.background_gradient(axis=0, cmap='Blues')
Out[25]:
  count mean std min 25% 50% 75% max
fixed acidity 1999.000000 0.360207 0.198232 0.000000 0.221239 0.300885 0.469027 1.000000
volatile acidity 1999.000000 0.288886 0.123548 0.000000 0.191781 0.280822 0.369863 1.000000
citric acid 1999.000000 0.246668 0.181348 0.000000 0.110000 0.200000 0.385000 1.000000
residual sugar 1999.000000 0.185493 0.218038 0.000000 0.072896 0.092777 0.169649 1.000000
chlorides 1999.000000 0.109581 0.080488 0.000000 0.076539 0.108153 0.126456 1.000000
free sulfur dioxide 1999.000000 0.270297 0.220313 0.000000 0.112676 0.211268 0.366197 1.000000
total sulfur dioxide 1999.000000 0.164727 0.130923 0.000000 0.063604 0.127208 0.236749 1.000000
density 1999.000000 0.470379 0.154891 0.000000 0.381424 0.479442 0.567548 1.000000
pH 1999.000000 0.522055 0.150713 0.000000 0.461538 0.527473 0.593407 1.000000
sulphates 1999.000000 0.169253 0.213258 0.000000 0.062842 0.087432 0.139344 1.000000
alcohol 1999.000000 0.344115 0.207566 0.000000 0.166667 0.303030 0.454545 1.000000

stratify split data evenly between classes before training¶

In [26]:
def stratified_sample(df, strata_col, n_rows):
    """
    Creates a stratified sample of a pandas dataframe with equal proportions for each stratum.
    """
    # Get the unique values in the strata column
    strata_vals = df[strata_col].unique()
    
    # Calculate the sample size for each stratum
    sample_size = int(np.ceil(n_rows / len(strata_vals)))
    
    # Sample an equal number of rows from each stratum
    samples = []
    for val in strata_vals:
        stratum = df[df[strata_col] == val]
        sample = stratum.sample(sample_size, replace=True)
        samples.append(sample)
    
    # Concatenate the samples and return the result
    result = pd.concat(samples)
    return result.sample(n_rows, replace=True)
In [34]:
# stratified split data into train/test sets

ratio_train = 0.8
ratio_val = 0.1
ratio_test = 0.1

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=ratio_test, random_state=1, stratify=y)

X_test = X_test.drop(columns='quality')

# rebalance classes
X_train_balanced = stratified_sample( X_train, 'quality', 1500)
y_train_balanced = X_train_balanced['quality']
X_train_balanced = X_train_balanced.drop(columns='quality')

# split train/val with balanced classes
ratio_remaining = 1 - ratio_test
ratio_val_adjusted = ratio_val / ratio_remaining
X_train, X_val, y_train, y_val = train_test_split(X_train_balanced, y_train_balanced, test_size=ratio_val_adjusted, random_state=1, stratify=y_train_balanced) 
In [35]:
df['quality'].value_counts()
Out[35]:
5    735
6    678
7    265
4     98
3     60
9     60
8     59
2     44
Name: quality, dtype: int64
In [36]:
y_train.value_counts()
Out[36]:
5    185
9    174
6    169
7    167
4    166
2    163
3    161
8    148
Name: quality, dtype: int64
In [37]:
y_val.value_counts()
Out[37]:
5    23
9    22
6    21
4    21
2    21
7    21
3    20
8    18
Name: quality, dtype: int64
In [38]:
y_test.value_counts()
Out[38]:
5    74
6    68
7    26
4    10
3     6
9     6
8     6
2     4
Name: quality, dtype: int64
In [ ]:
?? SKIP OLD CODE

# balanced split for train/val sets
df_train_stratified = stratified_sample( df_train, 'quality', 1000)

# stratified balanced split for train/test
X = df_train_stratified.drop(columns='split')
X = df_train_stratified.drop(columns='quality')
X = df_train_stratified.DataFrame(scaler, columns=X_temp.columns)

y = df_train_stratified['quality']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.10, random_state=0, stratify=y )

# test split for final model validation on unseen sample of real data
X_test = df_test.drop(columns='split')
X_test = df_test.drop(columns='quality')
X_test = df_test.DataFrame(scaler, columns=X_temp.columns)
y_test = df_test['quality']
In [39]:
# Convert labels to one-hot encoding
y_traincat = tf.keras.utils.to_categorical(y_train)
y_valcat = tf.keras.utils.to_categorical(y_val)
y_testcat = tf.keras.utils.to_categorical(y_test)
In [41]:
# create the pie chart to show rebalancing
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))

df.quality.value_counts().sort_index().plot.pie(ax=ax1)
ax1.set_title('Before Sampling')

# create the second pie chart for rebalanced 
# df_train_stratified.quality.value_counts().sort_index().plot.pie(ax=ax2)
y_train_balanced.value_counts().sort_index().plot.pie(ax=ax2)
ax2.set_title('After Sampling')

# set the title and adjust the layout
fig.suptitle('Class Distribution of Training Data')
fig.tight_layout()

# show the figure
plt.show()
No description has been provided for this image

Train DNN¶

In [42]:
#------------------------------------------------------------------------------
# hyperparameter tuning function
#------------------------------------------------------------------------------
def build_model(hp):

    print(hp)

    inputs = keras.Input(shape=(11, 11, 9))
    x = inputs

    n_layers = hp.Int( "n_layers", 2, 24 )
    nodeunits = hp.Int( 'units', 4, 32 )
    dropout =hp.Float( "dropout",0,0.25)
    learning_rate = hp.Float( "learning_rate", 0.00001, 10 )
    # batch_size = hp.Float( "batch_size", 4, 64 )
    optimizer = hp.Choice( "optimizer", ["adam", "adamax"] )

    # model = keras.Sequential()
    # model.add(keras.layers.Dense(
    #     hp.Choice('units', [8, 16, 32]),
    #     activation='relu'))
    # model.add(keras.layers.Dense(1, activation='relu'))
    # model.compile(loss='mse')

    #--------------------------------------
    # configure model
    #--------------------------------------

    #	number of hidden layers
    #	number of neurons
    #	activation function (relu)
    #	output layer (sigmoid for binary classification; softmax for binary or multiclass)

    # initialize ANN
    ann = tf.keras.models.Sequential()

    # add hidden layers
    for i in range(n_layers):
        ann.add(tf.keras.layers.Dense(nodeunits,activation="relu"))
    
    if dropout > 0:
        x = layers.Dropout(dropout)(x)

    # create output layer (number of units = number of classes)
    # ann.add(tf.keras.layers.Dense(units=1,activation="sigmoid"))
    ann.add(tf.keras.layers.Dense(units=10,activation="softmax"))

    # compile model
    # optimizers:
    #   Adam (good), AdamW, Adadelta, Adagrad, Adamax (good), Nadam, Ftrl (bad), Lion (very noise loss), SGD (good but takes alot of epochs)
    if optimizer == "adamax":
        opt = tf.keras.optimizers.Adamax(learning_rate)
    else:
        opt = tf.keras.optimizers.Adam(learning_rate)

    #	loss function
    #   mse, binary_crossentropy, categorical_crossentropy

    #	metrics (accuracy)

    ann.compile(optimizer=opt,loss="categorical_crossentropy",metrics=['accuracy'])

    return ann
In [43]:
# run start time
print("start time: "+str(datetime.datetime.now()))
starttime = time.time()

numtrials = 100
numepochs = 50

# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )

hp = keras_tuner.HyperParameters()

# hp.values["model_type"] = 

hp.Float(
    "learning_rate",
    min_value=0.001,
    max_value=0.1,
    sampling="log" )

hp.Int(
    "n_layers",
    min_value=3,
    max_value=5 )

hp.Int(
    "units",
    min_value=11,
    max_value=11 )

hp.Float(
    "dropout",
    min_value=0.0,
    max_value=0.05 )

hp.Int(
    "batch_size",
    min_value=4,
    max_value=32 )

hp.Choice(
    "optimizer",
     ["adam"] )

# hyperparameter tuning
dts = str(datetime.datetime.now().isoformat(timespec="seconds"))
dts = dts.replace(":","")
pathout = "./tuner_"+dts
print("output path: "+pathout)
# tuner = keras_tuner.RandomSearch(
tuner = keras_tuner.BayesianOptimization(
    build_model,
    objective='val_loss', # val_accuracy val_loss
    max_trials=numtrials,
    directory=pathout,
    hyperparameters=hp )

tuner.search( X_train, y_traincat, epochs=numepochs, validation_data=(X_val,y_valcat))
tuner.search_space_summary()
tuner.results_summary()

print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "  runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
Trial 100 Complete [00h 00m 07s]
val_loss: 1.701892614364624

Best val_loss So Far: 1.4627546072006226
Total elapsed time: 00h 11m 54s
Search space summary
Default search space size: 6
learning_rate (Float)
{'default': 0.001, 'conditions': [], 'min_value': 0.001, 'max_value': 0.1, 'step': None, 'sampling': 'log'}
n_layers (Int)
{'default': None, 'conditions': [], 'min_value': 3, 'max_value': 5, 'step': 1, 'sampling': 'linear'}
units (Int)
{'default': None, 'conditions': [], 'min_value': 11, 'max_value': 11, 'step': 1, 'sampling': 'linear'}
dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.05, 'step': None, 'sampling': 'linear'}
batch_size (Int)
{'default': None, 'conditions': [], 'min_value': 4, 'max_value': 32, 'step': 1, 'sampling': 'linear'}
optimizer (Choice)
{'default': 'adam', 'conditions': [], 'values': ['adam'], 'ordered': False}
Results summary
Results in ./tuner_2024-10-22T065359/untitled_project
Showing 10 best trials
Objective(name="val_loss", direction="min")

Trial 074 summary
Hyperparameters:
learning_rate: 0.0074544302703083674
n_layers: 4
units: 11
dropout: 0.04938495635729518
batch_size: 22
optimizer: adam
Score: 1.4627546072006226

Trial 028 summary
Hyperparameters:
learning_rate: 0.0037947236715404485
n_layers: 4
units: 11
dropout: 0.03210758114077595
batch_size: 4
optimizer: adam
Score: 1.4751256704330444

Trial 049 summary
Hyperparameters:
learning_rate: 0.007144535070604454
n_layers: 4
units: 11
dropout: 0.022585219692721422
batch_size: 7
optimizer: adam
Score: 1.5023205280303955

Trial 018 summary
Hyperparameters:
learning_rate: 0.003821897467082056
n_layers: 4
units: 11
dropout: 0.03211964593722686
batch_size: 4
optimizer: adam
Score: 1.5027107000350952

Trial 041 summary
Hyperparameters:
learning_rate: 0.007590405554212423
n_layers: 4
units: 11
dropout: 0.020283356076484457
batch_size: 7
optimizer: adam
Score: 1.5170587301254272

Trial 071 summary
Hyperparameters:
learning_rate: 0.008151383673141666
n_layers: 4
units: 11
dropout: 0.015563650824836961
batch_size: 6
optimizer: adam
Score: 1.5195058584213257

Trial 097 summary
Hyperparameters:
learning_rate: 0.004387410791857006
n_layers: 5
units: 11
dropout: 0.014184789986907293
batch_size: 17
optimizer: adam
Score: 1.5299832820892334

Trial 027 summary
Hyperparameters:
learning_rate: 0.003846585236459026
n_layers: 4
units: 11
dropout: 0.03209648374814431
batch_size: 5
optimizer: adam
Score: 1.549328088760376

Trial 087 summary
Hyperparameters:
learning_rate: 0.00306300122471149
n_layers: 4
units: 11
dropout: 0.02548569404969314
batch_size: 13
optimizer: adam
Score: 1.5562413930892944

Trial 026 summary
Hyperparameters:
learning_rate: 0.00352808252616922
n_layers: 4
units: 11
dropout: 0.032831012243722535
batch_size: 4
optimizer: adam
Score: 1.5564454793930054
2024-10-22 07:05:53  runtime: 714.114 seconds
In [44]:
# return the best hyperparameters
best_hp = tuner.get_best_hyperparameters()[0]
ann = tuner.hypermodel.build(best_hp)
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x17d8ab590>
In [45]:
# select the best model
best_model = tuner.get_best_models()[0]
best_model.summary()
<keras_tuner.src.engine.hyperparameters.hyperparameters.HyperParameters object at 0x17d8ab590>
/opt/anaconda3/lib/python3.12/site-packages/keras/src/saving/saving_lib.py:719: UserWarning: Skipping variable loading for optimizer 'adam', because it has 2 variables whereas the saved optimizer has 22 variables. 
  saveable.load_own_variables(weights_store.get(inner_path))
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_3 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_4 (Dense)                 │ (None, 10)             │           120 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 648 (2.53 KB)
 Trainable params: 648 (2.53 KB)
 Non-trainable params: 0 (0.00 B)
In [46]:
numepochs = 50
# fitout = ann.fit( X_train, Y_train, batch_size=batchsize, validation_data=(X_test,Y_test), epochs=numepochs )
fitout = ann.fit( X_train, y_traincat, validation_data=(X_val,y_valcat), epochs=numepochs )

# save model
modelfilename = "ANN.keras"
ann.save(modelfilename)

# load model from file
# ann = models.load_model(modelfilename)

# print metrics
ann.summary()
Epoch 1/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 1s 5ms/step - accuracy: 0.1346 - loss: 3.5397 - val_accuracy: 0.1078 - val_loss: 2.1768
Epoch 2/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 2ms/step - accuracy: 0.1166 - loss: 2.1740 - val_accuracy: 0.1617 - val_loss: 2.1067
Epoch 3/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.1510 - loss: 2.0800 - val_accuracy: 0.1198 - val_loss: 2.0392
Epoch 4/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.1557 - loss: 2.0303 - val_accuracy: 0.2036 - val_loss: 2.0375
Epoch 5/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.1768 - loss: 1.9928 - val_accuracy: 0.2096 - val_loss: 1.9912
Epoch 6/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2133 - loss: 1.9534 - val_accuracy: 0.1916 - val_loss: 1.9609
Epoch 7/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2025 - loss: 1.9302 - val_accuracy: 0.1856 - val_loss: 1.8807
Epoch 8/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2345 - loss: 1.8579 - val_accuracy: 0.2754 - val_loss: 1.8558
Epoch 9/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2564 - loss: 1.8188 - val_accuracy: 0.2695 - val_loss: 1.8513
Epoch 10/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2698 - loss: 1.7948 - val_accuracy: 0.2515 - val_loss: 1.8529
Epoch 11/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2661 - loss: 1.8533 - val_accuracy: 0.2216 - val_loss: 1.8807
Epoch 12/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2707 - loss: 1.8209 - val_accuracy: 0.2575 - val_loss: 1.8260
Epoch 13/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2993 - loss: 1.7979 - val_accuracy: 0.2695 - val_loss: 1.8186
Epoch 14/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2569 - loss: 1.7935 - val_accuracy: 0.2874 - val_loss: 1.8162
Epoch 15/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2698 - loss: 1.7849 - val_accuracy: 0.3054 - val_loss: 1.8301
Epoch 16/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2711 - loss: 1.7684 - val_accuracy: 0.2874 - val_loss: 1.7973
Epoch 17/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.2883 - loss: 1.7241 - val_accuracy: 0.2575 - val_loss: 1.7818
Epoch 18/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3137 - loss: 1.7333 - val_accuracy: 0.3234 - val_loss: 1.8156
Epoch 19/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3048 - loss: 1.7542 - val_accuracy: 0.2754 - val_loss: 1.7972
Epoch 20/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3174 - loss: 1.7066 - val_accuracy: 0.2335 - val_loss: 1.8081
Epoch 21/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3467 - loss: 1.7086 - val_accuracy: 0.2994 - val_loss: 1.7426
Epoch 22/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3255 - loss: 1.7223 - val_accuracy: 0.3413 - val_loss: 1.7313
Epoch 23/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3071 - loss: 1.6955 - val_accuracy: 0.3174 - val_loss: 1.7917
Epoch 24/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3396 - loss: 1.6738 - val_accuracy: 0.2814 - val_loss: 1.8449
Epoch 25/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3306 - loss: 1.6859 - val_accuracy: 0.2994 - val_loss: 1.7262
Epoch 26/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 5ms/step - accuracy: 0.3385 - loss: 1.6727 - val_accuracy: 0.3293 - val_loss: 1.7841
Epoch 27/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3709 - loss: 1.6238 - val_accuracy: 0.3174 - val_loss: 1.7263
Epoch 28/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3504 - loss: 1.6484 - val_accuracy: 0.3114 - val_loss: 1.7360
Epoch 29/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3633 - loss: 1.6535 - val_accuracy: 0.3593 - val_loss: 1.7664
Epoch 30/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3620 - loss: 1.6523 - val_accuracy: 0.3473 - val_loss: 1.7593
Epoch 31/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3349 - loss: 1.6692 - val_accuracy: 0.2994 - val_loss: 1.7544
Epoch 32/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3427 - loss: 1.6209 - val_accuracy: 0.2874 - val_loss: 1.8506
Epoch 33/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3087 - loss: 1.7013 - val_accuracy: 0.2994 - val_loss: 1.7154
Epoch 34/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3169 - loss: 1.6478 - val_accuracy: 0.3353 - val_loss: 1.7073
Epoch 35/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3555 - loss: 1.5861 - val_accuracy: 0.3114 - val_loss: 1.7314
Epoch 36/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3585 - loss: 1.6098 - val_accuracy: 0.3353 - val_loss: 1.6845
Epoch 37/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3990 - loss: 1.5710 - val_accuracy: 0.2994 - val_loss: 1.6980
Epoch 38/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3535 - loss: 1.6165 - val_accuracy: 0.3114 - val_loss: 1.6855
Epoch 39/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4082 - loss: 1.5475 - val_accuracy: 0.3234 - val_loss: 1.8166
Epoch 40/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3654 - loss: 1.6093 - val_accuracy: 0.2994 - val_loss: 1.7498
Epoch 41/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3639 - loss: 1.5571 - val_accuracy: 0.3713 - val_loss: 1.6956
Epoch 42/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3915 - loss: 1.5427 - val_accuracy: 0.3234 - val_loss: 1.7115
Epoch 43/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3638 - loss: 1.5470 - val_accuracy: 0.3473 - val_loss: 1.7054
Epoch 44/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3773 - loss: 1.5609 - val_accuracy: 0.3353 - val_loss: 1.6649
Epoch 45/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.4059 - loss: 1.5469 - val_accuracy: 0.3174 - val_loss: 1.7451
Epoch 46/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3632 - loss: 1.5654 - val_accuracy: 0.3234 - val_loss: 1.6650
Epoch 47/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3861 - loss: 1.5364 - val_accuracy: 0.3353 - val_loss: 1.7242
Epoch 48/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3804 - loss: 1.4801 - val_accuracy: 0.3533 - val_loss: 1.6914
Epoch 49/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3698 - loss: 1.5516 - val_accuracy: 0.3114 - val_loss: 1.7252
Epoch 50/50
42/42 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.3806 - loss: 1.5570 - val_accuracy: 0.3593 - val_loss: 1.7052
Model: "sequential_1"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense_4 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_5 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_6 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_7 (Dense)                 │ (None, 11)             │           132 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_8 (Dense)                 │ (None, 10)             │           120 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,946 (7.61 KB)
 Trainable params: 648 (2.53 KB)
 Non-trainable params: 0 (0.00 B)
 Optimizer params: 1,298 (5.07 KB)
In [47]:
# accuracy metrics
history = fitout.history

acc = history['accuracy']
loss = history['loss']
val_acc = history['val_accuracy']
val_loss = history['val_loss']

print("final train accuracy: "+str(acc[-1]))
print("final train loss    : "+str(loss[-1]))

print("final val accuracy: "+str(val_acc[-1]))
print("final val loss    : "+str(val_loss[-1]))

print( datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") + "  runtime: " + str(round(time.time()-starttime,3)) + " seconds" )
final train accuracy: 0.3810952603816986
final train loss    : 1.5434587001800537
final val accuracy: 0.359281450510025
final val loss    : 1.7051748037338257
2024-10-22 07:08:47  runtime: 887.795 seconds
In [48]:
epochs_range = range(numepochs)

plt.figure(figsize=(10,5))

plt.subplot(1,2,1)
plt.plot( epochs_range, acc, label='Training Accuracy' )
plt.plot( epochs_range, val_acc, label='Validation Accuracy' )
plt.legend( loc='lower right' )
plt.ylim(0,1)
plt.title('Training and Validation Accuracy', fontsize=15 )

plt.subplot(1,2,2)
plt.plot( epochs_range, loss, label='Training Loss' )
plt.plot( epochs_range, val_loss, label='Validation Loss' )
plt.legend( loc='upper right' )
# plt.ylim(0,1)
plt.title('Training and Validation Loss', fontsize=15 )

plt.show()
No description has been provided for this image

Run inference on test data to validate performance¶

In [49]:
# define a function to plot confusion matrix
def plot_confusion_matrix(y_test, y_prediction):
    '''Plotting Confusion Matrix'''
    cm = metrics.confusion_matrix(y_test, y_prediction)
    ax = plt.subplot()
    ax = sns.heatmap(cm, annot=True, fmt='', cmap="Blues")
    ax.set_xlabel('Prediced labels', fontsize=18)
    ax.set_ylabel('True labels', fontsize=18)
    ax.set_title('Confusion Matrix', fontsize=25)
    ax.xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    ax.yaxis.set_ticklabels(['Bad', 'Good', 'Middle']) 
    plt.show()
In [50]:
# define a function to plot classification report
def clfr_plot(y_test, y_pred) :
    ''' Plotting Classification report'''
    cr = pd.DataFrame(metrics.classification_report(y_test, y_pred_rf, digits=3,
                                            output_dict=True)).T
    cr.drop(columns='support', inplace=True)
    sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5).xaxis.tick_top()
In [65]:
def clf_plot(y_test, y_pred) :
    '''
    1) Ploting Confusion Matrix
    2) Plotting Classification Report
    '''

    y_predmax = np.argmax(y_pred, axis=1) 
    y_testmax = np.argmax(y_testcat, axis=1)

    # metrics.f1_score(y_test, y_pred, average='weighted', labels=np.unique(y_pred))
    # metrics.f1_score(y_test, y_pred, average='weighted',zero_division=0)

    cm = metrics.confusion_matrix(y_testmax, y_predmax)
    cr = pd.DataFrame(metrics.classification_report(y_testmax, y_predmax, digits=3, output_dict=True)).T
    cr.drop(columns='support', inplace=True)
    
    fig, ax = plt.subplots(1, 2, figsize=(15, 5))
    
    # Left AX : Confusion Matrix
    ax[0] = sns.heatmap(cm, annot=True, fmt='', cmap="Blues", ax=ax[0])
    ax[0].set_xlabel('Prediced labels', fontsize=18)
    ax[0].set_ylabel('True labels', fontsize=18)
    ax[0].set_title('Confusion Matrix', fontsize=25)
    # ax[0].xaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    # ax[0].yaxis.set_ticklabels(['Bad', 'Good', 'Middle'])
    
    # Right AX : Classification Report
    ax[1] = sns.heatmap(cr, cmap='Blues', annot=True, linecolor='white', linewidths=0.5, ax=ax[1])
    ax[1].xaxis.tick_top()
    ax[1].set_title('Classification Report', fontsize=25)
    plt.show()
In [72]:
# test predict (inference)
y_pred = ann.predict(X_test)
y_predmax = np.argmax(y_pred, axis=1) 
y_testmax = np.argmax(y_testcat, axis=1)
cm = confusion_matrix( y_testmax, y_predmax)
print(cm)

#ann_score = round(ann.score(X_test, y_test), 3)
#print('ANN score : ', ann_score)

clf_plot(y_test, y_pred)
7/7 ━━━━━━━━━━━━━━━━━━━━ 0s 839us/step
[[ 0  2  0  0  0  0  2  0]
 [ 1  0  1  1  0  1  2  0]
 [ 2  3  1  2  1  1  0  0]
 [ 0  4  7 30 18 12  3  0]
 [ 0  3  6 13 17 24  4  1]
 [ 0  2  0  1  2 17  4  0]
 [ 2  3  0  0  0  1  0  0]
 [ 0  5  0  0  0  0  1  0]]
No description has been provided for this image
In [ ]: